Predicting Wildfires¶

By: Eugenia Poon, Kayla Zhu

Motivation: The progressing issue of California wildfires in recent years has destroyed ecosystems, magnified air pollution and health risks, and weakened labor and resources.

Goal: predict the risk/presence/spread of fire

Table of Contents¶

  • Load and Clean Data

    • Shapefile
    • GRIDMET
    • Interpolation
  • Feature Engineering

    • Fire Column
    • Nearby Extreme Data
    • Categorize Burn Index
  • Exploratory Data Analysis

  • Modeling

    • Model 1: Classification Using Fire Perimeter (failed)
    • Model 2: Classification Using Burn Index Classes
    • Model 3: Regression Using Burn Index
    • Model 4: Regression + Classification Using Neural Network
  • Bootstrap

  • Final Model

  • Conclusion

  • wildFireFunc.py

In [1]:
%matplotlib inline
import os
import os.path
import random
import warnings
import zipfile

import geopandas as gpd
import numpy as np
import pandas as pd
import seaborn as sns
import tensorflow as tf
import xarray as xr
from IPython.display import HTML, Javascript, display
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
from imblearn.under_sampling import RandomUnderSampler
from matplotlib import pyplot as plt
from plotly import express as px, graph_objects as go
from sklearn.metrics import (
    balanced_accuracy_score,
    f1_score,
    mean_absolute_error,
    mean_squared_error,
    precision_recall_fscore_support,
)
from sklearn.preprocessing import StandardScaler
from tensorflow.keras.layers import Dense, Input, Normalization
from tensorflow.keras.models import Model
from tensorflow.keras.utils import to_categorical
from xgboost import XGBClassifier

from wildfireFunc import (
    getDF,
    getFireRange,
    getInter,
    getNC,
    plotInter,
    plotRes,
    model,
    plot_loss,
    score,
    split,
)


warnings.filterwarnings('ignore')
plt.rcParams['figure.figsize'] = (6, 4)
plt.rcParams['figure.dpi'] = 400
plt.rcParams['font.size'] = 5
plt.rcParams['figure.titlesize'] = 10
plt.rcParams['axes.linewidth'] = 0.1
plt.rcParams['patch.linewidth'] = 0

display(HTML("<style>.container { width:100% !important; }</style>"))
random.seed(100)


tf.random.set_seed(100)
np.random.seed(100)

Load and Clean Data¶

Shapefiles¶

table of contents

  • Only Northern Sierra Nevada: Tahoe, Plumas, and Lassen National Forest
  • How to Query
Sierra Nevada Fire Perimeter
SOURCE Sierra Nevada Conservancy Boundary CAL Fire
OPEN View Data Source View Data Source
OPEN California Fire Perimeters (1950+)
OPEN Query Query
Where: 1=1 YEAR_=2021
Geometry Type: Polygon Polygon
Spatial Relationship: Intersects Intersects
Units: Meters Meters
Out fields: * *
Return Geometry: True True
Format: GeoJSON GeoJSON
Units: Meters Meters
CLICK Query (GET) Query (GET)
In [2]:
# sierra nevada
url = 'https://services1.arcgis.com/sTaVXkn06Nqew9yU/arcgis/rest/services/Sierra_Nevada_Conservancy_Boundary/FeatureServer/0/query?where=1%3D1&objectIds=&time=&geometry=&geometryType=esriGeometryPolygon&inSR=&spatialRel=esriSpatialRelIntersects&resultType=none&distance=&units=esriSRUnit_Meter&relationParam=&returnGeodetic=false&outFields=*&returnGeometry=true&returnCentroid=false&featureEncoding=esriDefault&multipatchOption=xyFootprint&maxAllowableOffset=&geometryPrecision=&outSR=&defaultSR=&datumTransformation=&applyVCSProjection=false&returnIdsOnly=false&returnUniqueIdsOnly=false&returnCountOnly=false&returnExtentOnly=false&returnQueryGeometry=false&returnDistinctValues=false&cacheHint=false&orderByFields=&groupByFieldsForStatistics=&outStatistics=&having=&resultOffset=&resultRecordCount=&returnZ=false&returnM=false&returnExceededLimitFeatures=true&quantizationParameters=&sqlFormat=none&f=pgeojson&token='
snBoundaries = gpd.read_file(url).to_crs(epsg=4326)
snBounds = snBoundaries.geometry.bounds

bounds = [-122, 39.35, snBounds.iloc[0].maxx, 41]
boundaries = gpd.clip(snBoundaries, bounds)
bounds = boundaries.geometry.bounds
display(snBounds)

# fire perimeter
url1 = 'https://services1.arcgis.com/jUJYIo9tSA7EHvfZ/ArcGIS/rest/services/California_Fire_Perimeters/FeatureServer/2/query?where=YEAR_%3D'
url2 = '&objectIds=&time=&geometry=&geometryType=esriGeometryPolygon&inSR=&spatialRel=esriSpatialRelIntersects&resultType=none&distance=&units=esriSRUnit_Meter&relationParam=&returnGeodetic=false&outFields=*&returnGeometry=true&returnCentroid=false&featureEncoding=esriDefault&multipatchOption=xyFootprint&maxAllowableOffset=&geometryPrecision=&outSR=&defaultSR=&datumTransformation=&applyVCSProjection=false&returnIdsOnly=false&returnUniqueIdsOnly=false&returnCountOnly=false&returnExtentOnly=false&returnQueryGeometry=false&returnDistinctValues=false&cacheHint=false&orderByFields=&groupByFieldsForStatistics=&outStatistics=&having=&resultOffset=&resultRecordCount=&returnZ=false&returnM=false&returnExceededLimitFeatures=true&quantizationParameters=&sqlFormat=none&f=pgeojson&token='
years = ['2020', '2021']
firePer = pd.concat([gpd.read_file(url1 + y + url2) for y in years])
dateCol = ['ALARM_DATE', 'CONT_DATE']
firePer = firePer[firePer.STATE == 'CA'].dropna(subset=dateCol)
firePer.geometry = firePer.geometry.make_valid()  # 2021 has invalid geometry
firePer = gpd.clip(firePer, boundaries)  # fires in area
firePer[dateCol] = firePer[dateCol].apply(pd.to_datetime, unit='ms')
firePer = firePer.sort_values('ALARM_DATE').reset_index(drop=True)
print(f'{len(firePer)} fires')
display(firePer.sort_values('Shape__Area')[
        ['FIRE_NAME', 'Shape__Area']].tail())
minx miny maxx maxy
0 -123.039382 35.064862 -117.735384 41.994919
64 fires
FIRE_NAME Shape__Area
26 LOYALTON 3.198601e+08
41 W-5 COLD SPRINGS 6.049077e+08
58 SUGAR 7.236506e+08
34 NORTH COMPLEX 2.183903e+09
59 DIXIE 6.692866e+09
In [3]:
fig, ax = plt.subplots(nrows=1, ncols=1, dpi=600)
snBoundaries.plot(ax=ax, color='grey', alpha=0.1, linewidth=0)
boundaries.plot(ax=ax, color='grey', alpha=0.3, linewidth=0)
firePer.plot('YEAR_', legend=True, alpha=0.6, linewidth=0, ax=ax)
plt.tick_params(axis='both', labelsize=5)
ax.set(xlabel='longitude', ylabel='latitude')
plt.show()
No description has been provided for this image

Geospatial Data: GRIDMET¶

table of contents

  • GRIDMET (Download NetCDF Data)
  • Clip 2021 and 2022 GRIDMET data to boundaries within Tahoe, Plumas, and Lassen National Forest area in Sierra Nevada

Description¶

  • What starts a fire? high winds, high temperatures, low humidity
  • Years: 2020-2021
  • Resolution: 4638.3 meters = 4.6383 km = 1/24 degrees
  • Temporal Resolution: daily
  • Key inputs into the NFDRS model are: fuels, weather, topography and risks.
variable var units description
relative_humidity rmin % minimum relative humidity
air_temperature tmmx K maximum temperature
wind_speed vs m/s wind velocity at 10m
burning_index_g bi NFDRS fire danger index burning index
In [4]:
path = 'data/gm.nc'
if os.path.isfile(path+'.zip') == False:  # ~3 minutes
    getNC(path, bounds, boundaries, years)
    !rm data/gm.nc
ds = xr.open_dataset(zipfile.ZipFile(path+'.zip').open('gm.nc'))
print(ds.dims)
Frozen({'lon': 52, 'lat': 44, 'day': 731})

Interpolation¶

table of contents

  • Used for testing data with higher factor --> less data
  • To get a lower resolution, choose a higher factor
  • Lowering resolution will result in misleading performance scores because of increased generalization of an area
In [5]:
factor = 1
dd = getInter(ds, factor)
print(np.unique(abs(np.diff(dd.lon.to_numpy()))),
      np.unique(abs(np.diff(dd.lat.to_numpy()))))
print()
res = abs(np.diff(dd.lon.to_numpy()))[0]
display(dd)
[0.04166667 0.04166667] [0.04166667 0.04166667]

<xarray.Dataset>
Dimensions:            (lon: 52, lat: 44, day: 731)
Coordinates:
  * lon                (lon) float64 -122.1 -122.0 -122.0 ... -120.0 -119.9
  * lat                (lat) float64 41.07 41.03 40.98 ... 39.36 39.32 39.28
  * day                (day) datetime64[ns] 2020-01-01 2020-01-02 ... 2021-12-31
Data variables:
    burning_index_g    (day, lat, lon) float32 ...
    relative_humidity  (day, lat, lon) float32 ...
    air_temperature    (day, lat, lon) float32 ...
    wind_speed         (day, lat, lon) float32 ...
Attributes:
    units:              Unitless
    description:        BI-G
    long_name:          bi
    standard_name:      bi
    dimensions:         lon lat time
    grid_mapping:       crs
    coordinate_system:  WGS84,EPSG:4326
xarray.Dataset
    • lon: 52
    • lat: 44
    • day: 731
    • lon
      (lon)
      float64
      -122.1 -122.0 ... -120.0 -119.9
      units :
      degrees_east
      description :
      longitude
      long_name :
      longitude
      standard_name :
      longitude
      axis :
      X
      array([-122.058333, -122.016667, -121.975   , -121.933333, -121.891667,
             -121.85    , -121.808333, -121.766667, -121.725   , -121.683333,
             -121.641667, -121.6     , -121.558333, -121.516667, -121.475   ,
             -121.433333, -121.391667, -121.35    , -121.308333, -121.266667,
             -121.225   , -121.183333, -121.141667, -121.1     , -121.058333,
             -121.016667, -120.975   , -120.933333, -120.891667, -120.85    ,
             -120.808333, -120.766667, -120.725   , -120.683333, -120.641667,
             -120.6     , -120.558333, -120.516667, -120.475   , -120.433333,
             -120.391667, -120.35    , -120.308333, -120.266667, -120.225   ,
             -120.183333, -120.141667, -120.1     , -120.058333, -120.016667,
             -119.975   , -119.933333])
    • lat
      (lat)
      float64
      41.07 41.03 40.98 ... 39.32 39.28
      units :
      degrees_north
      description :
      latitude
      long_name :
      latitude
      standard_name :
      latitude
      axis :
      Y
      array([41.066667, 41.025   , 40.983333, 40.941667, 40.9     , 40.858333,
             40.816667, 40.775   , 40.733333, 40.691667, 40.65    , 40.608333,
             40.566667, 40.525   , 40.483333, 40.441667, 40.4     , 40.358333,
             40.316667, 40.275   , 40.233333, 40.191667, 40.15    , 40.108333,
             40.066667, 40.025   , 39.983333, 39.941667, 39.9     , 39.858333,
             39.816667, 39.775   , 39.733333, 39.691667, 39.65    , 39.608333,
             39.566667, 39.525   , 39.483333, 39.441667, 39.4     , 39.358333,
             39.316667, 39.275   ])
    • day
      (day)
      datetime64[ns]
      2020-01-01 ... 2021-12-31
      description :
      days since 1900-01-01
      long_name :
      time
      standard_name :
      time
      array(['2020-01-01T00:00:00.000000000', '2020-01-02T00:00:00.000000000',
             '2020-01-03T00:00:00.000000000', ..., '2021-12-29T00:00:00.000000000',
             '2021-12-30T00:00:00.000000000', '2021-12-31T00:00:00.000000000'],
            dtype='datetime64[ns]')
    • burning_index_g
      (day, lat, lon)
      float32
      ...
      units :
      Unitless
      description :
      BI-G
      long_name :
      bi
      standard_name :
      bi
      dimensions :
      lon lat time
      grid_mapping :
      crs
      coordinate_system :
      WGS84,EPSG:4326
      [1672528 values with dtype=float32]
    • relative_humidity
      (day, lat, lon)
      float32
      ...
      units :
      %
      description :
      Daily Minimum Relative Humidity
      long_name :
      rmin
      standard_name :
      rmin
      dimensions :
      lon lat time
      grid_mapping :
      crs
      coordinate_system :
      WGS84,EPSG:4326
      [1672528 values with dtype=float32]
    • air_temperature
      (day, lat, lon)
      float32
      ...
      units :
      K
      description :
      Daily Maximum Temperature
      long_name :
      tmmx
      standard_name :
      tmmx
      dimensions :
      lon lat time
      grid_mapping :
      crs
      coordinate_system :
      WGS84,EPSG:4326
      [1672528 values with dtype=float32]
    • wind_speed
      (day, lat, lon)
      float32
      ...
      units :
      m/s
      description :
      Daily Mean Wind Speed (10m)
      long_name :
      vs
      standard_name :
      vs
      dimensions :
      lon lat time
      grid_mapping :
      crs
      coordinate_system :
      WGS84,EPSG:4326
      [1672528 values with dtype=float32]
    • lon
      PandasIndex
      PandasIndex(Index([-122.05833330000002, -122.01666663333334, -121.97499996666667,
             -121.93333330000002, -121.89166663333334, -121.84999996666667,
             -121.80833330000002, -121.76666663333334, -121.72499996666667,
             -121.68333330000002, -121.64166663333334, -121.59999996666667,
             -121.55833330000002, -121.51666663333334, -121.47499996666667,
             -121.43333330000002, -121.39166663333334, -121.34999996666667,
             -121.30833330000002, -121.26666663333334, -121.22499996666667,
             -121.18333330000002, -121.14166663333334, -121.09999996666667,
             -121.05833330000002, -121.01666663333334, -120.97499996666667,
             -120.93333330000002, -120.89166663333334, -120.84999996666667,
             -120.80833330000002, -120.76666663333334, -120.72499996666667,
             -120.68333330000002, -120.64166663333334, -120.59999996666667,
             -120.55833330000002, -120.51666663333334, -120.47499996666667,
             -120.43333330000002, -120.39166663333334, -120.34999996666667,
             -120.30833330000002, -120.26666663333334, -120.22499996666667,
             -120.18333330000002, -120.14166663333334, -120.09999996666667,
             -120.05833330000002, -120.01666663333334, -119.97499996666667,
             -119.93333330000002],
            dtype='float64', name='lon'))
    • lat
      PandasIndex
      PandasIndex(Index([ 41.06666666666667, 41.025000000000006,  40.98333333333334,
              40.94166666666667, 40.900000000000006,  40.85833333333334,
              40.81666666666667, 40.775000000000006,  40.73333333333334,
              40.69166666666667, 40.650000000000006,  40.60833333333334,
              40.56666666666667, 40.525000000000006,  40.48333333333334,
              40.44166666666667, 40.400000000000006,  40.35833333333334,
              40.31666666666667, 40.275000000000006,  40.23333333333334,
              40.19166666666667, 40.150000000000006,  40.10833333333334,
              40.06666666666667, 40.025000000000006,  39.98333333333334,
              39.94166666666667, 39.900000000000006,  39.85833333333334,
              39.81666666666667, 39.775000000000006,  39.73333333333334,
              39.69166666666667, 39.650000000000006,  39.60833333333334,
              39.56666666666667, 39.525000000000006,  39.48333333333334,
              39.44166666666667, 39.400000000000006,  39.35833333333334,
              39.31666666666667, 39.275000000000006],
            dtype='float64', name='lat'))
    • day
      PandasIndex
      PandasIndex(DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04',
                     '2020-01-05', '2020-01-06', '2020-01-07', '2020-01-08',
                     '2020-01-09', '2020-01-10',
                     ...
                     '2021-12-22', '2021-12-23', '2021-12-24', '2021-12-25',
                     '2021-12-26', '2021-12-27', '2021-12-28', '2021-12-29',
                     '2021-12-30', '2021-12-31'],
                    dtype='datetime64[ns]', name='day', length=731, freq=None))
  • units :
    Unitless
    description :
    BI-G
    long_name :
    bi
    standard_name :
    bi
    dimensions :
    lon lat time
    grid_mapping :
    crs
    coordinate_system :
    WGS84,EPSG:4326
In [6]:
plotInter(firePer, ds)

Feature Engineering¶

Create Fire Column Using Fire Perimeter¶

table of contents

  • Indicate whether a coordinate had a fire on a date based on Cal Fire's data
  • Method:
    • coordinates (lon, lat) are centers of pixels, create bigger polygons (squares) to include area around coordinate
    • check if dates had a fire
      • True: yes fire
      • False: no fire
    • Check if polygon had fire on a date
      • True: yes fire on a date and at coordinate
      • False: other
  • How to get intersection
In [7]:
path = 'data/gm.pkl'
if os.path.isfile(path+'.zip') == False:
    getDF(path, res, dd, firePer)
    !rm data/gm.pkl

df = pd.read_pickle(zipfile.ZipFile(path+'.zip').open('gm.pkl'))
df['coor'] = df.groupby(['lat', 'lon']).ngroup()
display(df.info(verbose=False))
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1319455 entries, 0 to 1319454
Columns: 10 entries, lon to coor
dtypes: datetime64[ns](1), float32(4), float64(2), int64(3)
memory usage: 80.5 MB
None
In [8]:
# get data for 3 days before
fire = df.copy()
cols = ['burning_index_g', 'relative_humidity',
        'air_temperature', 'wind_speed']
shifted = []
for i in np.arange(1, 4):
    temp = fire.groupby('coor')[cols].shift(i)
    temp = temp.add_suffix(i)
    shifted.append(temp)
shifted = pd.concat(shifted, axis=1).fillna(method='bfill')
fire = pd.concat([fire, shifted], axis=1)  # keep current
# fire = pd.concat([fire.drop(columns=cols), shifted], axis=1) # remove current
fire = fire.reindex(sorted(fire.columns), axis=1)

Get Nearby Extreme Data¶

table of contents

For every coordinate of day x:

  • get the surrounding coordinates (pixels around a pixel)
  • get extreme data for two days before (day x-1 and x-2) for surrounding coords and the coord
    • max air temperatures
    • max burning index
    • max wind speed
    • min relative humidity
In [9]:
ll = df[['lat', 'lon']].drop_duplicates()


def getSurrounding(lat, lon, ll=ll, fire=fire):
    n, s = lat + res, lat - res
    e, w = lon + res, lon - res
    ll = ll[(ll.lat <= n) & (ll.lat >= s) & (ll.lon <= e) & (ll.lon >= w)]
    temp = fire[fire.lat.isin(ll.lat) & fire.lon.isin(ll.lon)]
    # remove center
    plus = temp.lat + temp.lon**2
    temp = temp[plus != lat+lon**2]
    temp['lat'] = lat
    temp['lon'] = lon
    return temp


# get surrounding coordinates
f = fire.loc[:, ~fire.columns.str.endswith('3')]
ex = pd.concat([getSurrounding(lat=g[0][0], lon=g[0][1])
                for g in f.groupby(['lat', 'lon'])])


# get extreme data for 2 days before (surrounding)
extreme = (ex.groupby(['lat', 'lon', 'day'])
           .agg(at1=('air_temperature1', max),
                at2=('air_temperature2', max),
                bi1=('burning_index_g1', max),
                bi2=('burning_index_g2', max),
                rh1=('relative_humidity1', min),
                rh2=('relative_humidity2', min),
                ws1=('wind_speed1', max),
                ws2=('wind_speed2', max),))
extreme = (f.merge(extreme.reset_index(), on=['lat', 'lon', 'day'])
           .drop(columns=['fire', 'lon', 'lat', 'relative_humidity',
                          'air_temperature', 'wind_speed']))

Categorize Burn Index¶

table of contents

Burning index pg25 or 35 (Table)

BI-1978 Narrative Comments
0-30 Most prescribed burns are conducted in this range.
30-40 Generally represent the limit of control for direct attack methods.
40-60 Machine methods usually necessary or indirect attack should be used
60-80 The prospects for direct control by any means are poor above this intensity.
80-90 The heat load on people within 30 feet of the fire is dangerous.
90-110+ Above this intensity, spotting, fire whirls, and crowning should be expected.
In [10]:
bins = [-1, 30, 40, 60, 80, 90, max(extreme['burning_index_g'])+1]
bins = [-1, 30, 60, 80, 90, max(extreme['burning_index_g'])+1]
labels = np.arange(len(bins)-1)
danger = to_categorical(pd.cut(extreme['burning_index_g'],
                               bins=bins, labels=labels))
extreme = pd.concat([extreme, pd.DataFrame(danger)], axis=1)
sns.countplot(x=np.argmax(danger, axis=1), palette=sns.color_palette('pastel'))
Out[10]:
<Axes: ylabel='count'>
No description has been provided for this image

Sample¶

In [11]:
S = 100
fireS = fire.sample(n=100000, random_state=S)
extremeS = extreme.sample(n=100000, random_state=S)

Exploratory Data Analysis¶

table of contents

In [12]:
aa = df.groupby('day').mean(numeric_only=True).drop(columns='fire')
bb = pd.DataFrame(StandardScaler().fit_transform(aa),
                  columns=aa.columns, index=aa.index)
bb = bb[cols].rolling(15).mean().reset_index()

px.line(bb, x=bb.day, y=cols).update_layout(margin=dict(t=0, b=0, r=0, l=0))
In [13]:
col = ['burning_index_g', 'relative_humidity', 'air_temperature', 'wind_speed']
n = ['fire_length', 'start_month']
fig1 = px.histogram(fireS, x=col, histnorm='percent')
fig2 = px.histogram(x=[len(x) for x in getFireRange(firePer)],
                    histnorm='percent')
fig2.update_traces(dict(name=n[0], offsetgroup=n[0], legendgroup=n[0],
                        marker_color=px.colors.qualitative.Plotly[5],
                        showlegend=True))
fig3 = px.histogram(x=firePer.ALARM_DATE.dt.month, histnorm='percent')
fig3.update_traces(dict(name=n[1], offsetgroup=n[1], legendgroup=n[1],
                        marker_color=px.colors.qualitative.Plotly[6],
                        showlegend=True))

fig = go.Figure(data=fig1.data+fig2.data+fig3.data)
fig.update_layout(margin=dict(t=15, b=0, r=0, l=0),
                  barmode='overlay', bargap=0.1)
fig.update_traces(visible='legendonly', opacity=0.7)
fig.data[0].visible = True
fig.show()

Modeling¶

Model 1: Classification Using Fire Perimeter¶

Idea predict if each coordinate had a fire on a day
Data independent: indicator for fire using dates and coordinates given by fire perimeter
dependent: weather of the previous 3 days and others
Metric F1 considers precision and recall and not true negatives (no fire)
Maximize true positives or correctly predicting yes fires because incorrectly predicting yes fire as no fire is dangerous
Issue perimeters include all coordinates that had fire throughout one or more days
for fires that span multiple days, some coordinates might not have any fire until some days after the start of the fire
model will still indicate fire even if weather conditions are normal

table of contents

In [14]:
target = ['fire']
temp = fireS.drop(columns=['burning_index_g', 'relative_humidity',
                           'air_temperature', 'wind_speed'])
X1, X_test1, y1, y_test1, X_train1, X_val1, y_train1, y_val1 = split(
    temp, target)

print(f'{np.mean(y1) * 100:.2f}% yes fire\n')

mm1 = model(X_train1, y_train1, X_val1, y_val1, method='c')
4.17% yes fire

Logistic Regression -  Decision Tree -  LDA -  LightGBM -  XGBoost -  
F1 TPR FPR Precision logloss
XGBoost 0.677863 0.595174 0.006955 0.787234 0.845023
LightGBM 0.616260 0.508043 0.006086 0.783058 0.945145
Decision Tree 0.523975 0.505362 0.018315 0.544012 1.371661
Logistic Regression 0.169022 0.761394 0.313377 0.095063 11.183545
LDA 0.015584 0.008043 0.001043 0.250000 1.517838

Refining Model: XGBoost¶

Model is not expected to perform better because of the previously mentioned issues

Class Imbalance

  • majority (0/no fire) and minority (1/yes fire)
  • severity of imbalance doesn't include enough 'yes fire' info for training
  • accuracy is meaningless: if model classifies all no's right and yes's wrong, then accuracy would equal to the proportion of no's (~0.96)
  • Possible solutions:
    • adjust class weights to assign higher weights to minority class
    • over and under-sampling with SMOTE to synthesize new samples for minority

For the training data

  • SMOTE oversample minority class
  • Under sample majority class
In [15]:
pipeline = Pipeline(steps=[('over', SMOTE(sampling_strategy=0.1)),
                           ('under', RandomUnderSampler(sampling_strategy=0.5))
                           ])
X_train1_, y_train1_ = pipeline.fit_resample(X_train1, y_train1)

xgb1 = XGBClassifier(eval_metric='aucpr', random_state=100)
xgb1.fit(X_train1_, y_train1_)
_, _, _, _, _ = score(y_val1, xgb1.predict(X_val1), method='c', p=True)
f1: 0.5789, TPR: 0.8901, FPR: 0.0512, Precision: 0.4289

Model 2: Classification Using Burn Index Classes¶

Idea predict the risk of fire based on prior nearby conditions
Data independent: categorized burn index
dependent: weather from 2 days before and worst weather conditions of nearby values
Metric F1
Issue categorization may not be descriptive enough but may be useful for identifying next-day dangers assuming that fires are environmentally caused or continuous

table of contents

In [16]:
target = list(labels)
temp = extremeS.drop(columns='burning_index_g')
X, X_test, yC, y_testC, X_train, X_val, y_trainC, y_valC = split(temp, target)

y_trainC = np.argmax(y_trainC, axis=1)
pipeline = Pipeline(steps=[('over', SMOTE()),
                           ('under', RandomUnderSampler())])
X_train_, y_trainC_ = pipeline.fit_resample(X_train, y_trainC)

y_trainC = pd.DataFrame(to_categorical(y_trainC), index=X_train.index)
y_trainC_ = pd.DataFrame(to_categorical(y_trainC_), index=X_train_.index)


def heat(true, pred, ax, title):
    print(title, end=': ')
    print(f"F1={f1_score(true, pred, average='weighted'):.3f}", end=' ')
    print(f'Acc={balanced_accuracy_score(true, pred):.3f}')
    pre, rec, f1b, sup = precision_recall_fscore_support(
        true, pred, labels=(labels))
    comp = pd.DataFrame([pre, rec, f1b, sup], index=['pre', 'rec', 'f1',
                                                     'sup']).add_prefix('class')
    display(comp)
    sns.heatmap(pd.crosstab(true, pred,
                            rownames=['Actual'], colnames=['Predicted'],
                            margins=True, normalize='index'
                            ), cmap='Blues', annot=True, ax=ax)
    ax.set(title=title)
    return comp


fig, axs = plt.subplots(ncols=2, figsize=(15, 7))
# original
xgbC = XGBClassifier(eval_metric='aucpr', random_state=100)
xgbC.fit(X_train, y_trainC)
comp = heat(np.argmax(y_valC, axis=1), np.argmax(xgbC.predict(X_val), axis=1),
            axs[0], 'original')

# smote
xgbC = XGBClassifier(eval_metric='aucpr', random_state=100)
xgbC.fit(X_train_, y_trainC_)
comp = heat(np.argmax(y_valC, axis=1), np.argmax(xgbC.predict(X_val), axis=1),
            axs[1], 'smote')
plt.show()
original: F1=0.770 Acc=0.587
class0 class1 class2 class3 class4
pre 0.755582 0.805201 0.765748 0.771739 0.908163
rec 0.885100 0.770072 0.703981 0.154013 0.421801
f1 0.815229 0.787245 0.733567 0.256781 0.576052
sup 6423.000000 7037.000000 3868.000000 461.000000 211.000000
smote: F1=0.750 Acc=0.629
class0 class1 class2 class3 class4
pre 0.710569 0.813910 0.776124 0.519298 0.603960
rec 0.884478 0.743357 0.620217 0.321041 0.578199
f1 0.788043 0.777035 0.689467 0.396783 0.590799
sup 6423.000000 7037.000000 3868.000000 461.000000 211.000000
No description has been provided for this image

Model 3: Regression Using Burn Index¶

Idea predict the spread of fire based on prior nearby conditions
Data independent: burn index
dependent: weather from 2 days before and worst weather conditions of nearby values
Metric RMSE
Issue difficult to correctly predict zeros

table of contents

In [17]:
target = ['burning_index_g']
temp = extremeS.drop(columns=labels)
X, X_test, yR, y_testR, X_train, X_val, y_trainR, y_valR = split(temp, target)

mmR = model(X_train, y_trainR, X_val, y_valR, method='r')
plotRes(y_valR, mmR['XGBoost'][0].predict(X_val))
Linear Regression -  Decision Tree -  LightGBM -  XGBoost -  
rmse mae
XGBoost 13.303519 9.051157
LightGBM 13.791284 9.456384
Linear Regression 15.969732 10.757834
Decision Tree 17.770387 10.049889
No description has been provided for this image

Model 4: Neural Network¶

Idea predict the spread and risk of fire based on prior nearby conditions
Data independent: burn index, categorized burn index
dependent: weather from 2 days before and worst weather conditions of nearby values
Metric RMSE, F1
Issue same as model 2, 3

table of contents

In [18]:
def modelNN(X_train, y_trainC):
    # normalize data
    normalizer = Normalization(axis=-1)
    normalizer.adapt(np.array(X_train))
    # input layer
    visible = Input(shape=(X_train.shape[1],))
    # hidden layers
    hidden1 = Dense(512, activation='relu',
                    kernel_initializer='he_normal')(normalizer(visible))
    hidden2 = Dense(512, activation='relu',
                    kernel_initializer='he_normal')(hidden1)
    # output layer
    outR = Dense(1, activation='linear')(hidden2)  # reg
    outC = Dense(y_trainC.shape[1], activation='softmax')(hidden2)  # class

    model = Model(inputs=visible, outputs=[outR, outC])
    model.compile(loss=['mse', 'categorical_crossentropy'], optimizer='adam',
                  metrics=['RootMeanSquaredError', 'F1Score'])
    return model
In [19]:
%%time
nn4 = modelNN(X_train, y_trainC)
history = nn4.fit(X_train, [y_trainR, y_trainC], validation_split=0.3,
                  epochs=50, batch_size=64, verbose=0)
y_predR, y_predC = nn4.predict(X_val, verbose=0)
y_predC = np.argmax(y_predC, axis=1)
y_valC = np.argmax(y_valC, axis=1)
plot_loss(history)
CPU times: user 1min 22s, sys: 25.3 s, total: 1min 47s
Wall time: 50.4 s
No description has been provided for this image
In [20]:
print('Regression:    ', end=' ')
print(f'MAE={mean_absolute_error(y_valR, y_predR):.3f}', end=' ')
print(f'RMSE={mean_squared_error(y_valR, y_predR, squared=False):.3f}')
print('Classification:', end=' ')

plotRes(y_valR, y_predR)

fig, axs = plt.subplots()
compNN = heat(y_valC, y_predC, axs, 'Classification')
Regression:     MAE=8.419 RMSE=12.219
Classification: 
No description has been provided for this image
Classification: F1=0.787 Acc=0.604
class0 class1 class2 class3 class4
pre 0.878116 0.756366 0.744739 0.405405 0.784946
rec 0.844621 0.831604 0.704498 0.292842 0.345972
f1 0.861043 0.792203 0.724060 0.340050 0.480263
sup 6423.000000 7037.000000 3868.000000 461.000000 211.000000
No description has been provided for this image
In [21]:
display(comp, compNN)
class0 class1 class2 class3 class4
pre 0.710569 0.813910 0.776124 0.519298 0.603960
rec 0.884478 0.743357 0.620217 0.321041 0.578199
f1 0.788043 0.777035 0.689467 0.396783 0.590799
sup 6423.000000 7037.000000 3868.000000 461.000000 211.000000
class0 class1 class2 class3 class4
pre 0.878116 0.756366 0.744739 0.405405 0.784946
rec 0.844621 0.831604 0.704498 0.292842 0.345972
f1 0.861043 0.792203 0.724060 0.340050 0.480263
sup 6423.000000 7037.000000 3868.000000 461.000000 211.000000
In [22]:
comp.loc['f1', 'class2':].values, compNN.loc['f1', 'class2':].values
Out[22]:
(array([0.68946688, 0.39678284, 0.59079903]),
 array([0.72406005, 0.34005038, 0.48026316]))

Bootstrap¶

table of contents

In [23]:
def perform_bootstrap(X_test, y_testR, y_testC, models: dict, samp=500):
    np.random.seed(100)
    for bs_iter in range(samp):
        bs_index = np.random.choice(
            X_test.index, len(X_test.index), replace=True)
        bs_data = X_test.loc[bs_index]
        bs_testR = y_testR.loc[bs_index]
        bs_testC = y_testC.loc[bs_index]
        bs_testC = np.argmax(bs_testC, axis=1)
        for m, model in models.items():
            if model[0][1] == 'c':
                try:
                    bs_pred = model[0][0].predict(bs_data, verbose=0)
                    bs_pred = np.argmax(bs_pred[1], axis=1)
                except:
                    bs_pred = model[0][0].predict(bs_data)
                    bs_pred = np.argmax(bs_pred, axis=1)
                acc = balanced_accuracy_score(bs_testC, bs_pred)
                f1 = f1_score(bs_testC, bs_pred, average='weighted')
                model[1].append([acc, f1])
            elif model[0][1] == 'r':
                try:
                    bs_pred = model[0][0].predict(bs_data, verbose=0)
                    bs_pred = bs_pred[0]
                except:
                    bs_pred = model[0][0].predict(bs_data)

                model[1].append([*score(bs_testR, bs_pred, method='r')])
    return models
In [24]:
%%time
m = {
    'XGBClassifier': [[xgbC, 'c'], []],  # 2
    'XGBRegressor': [[mmR['XGBoost'][0], 'r'], []],  # 3
    'Neural Network Class': [[nn4, 'c'], []],  # 4C
    'Neural Network Reg': [[nn4, 'r'], []],  # 4R
}

bs = perform_bootstrap(X_test, y_testR, y_testC, models=m, samp=500)
CPU times: user 22min 48s, sys: 4min 19s, total: 27min 8s
Wall time: 15min 10s
In [25]:
# Bootstrap: Classification
fig, ax = plt.subplots(nrows=1, ncols=2)
for i in [0, 2]:
    sns.histplot(np.array(list(bs.values())[i][1])[:, 0], edgecolor='none',
                 color=sns.color_palette()[i], alpha=0.4, ax=ax[0])
    sns.histplot(np.array(list(bs.values())[i][1])[:, 1], edgecolor='none',
                 color=sns.color_palette()[i], alpha=0.4, ax=ax[1])

fig.suptitle('Bootstrap: Classification')
ax[0].set_title('Accuracy')
ax[1].set_title('F1')
plt.legend(title='Model', labels=list(bs.keys())[0:3:2],
           bbox_to_anchor=(1, 1.02), loc='upper left')
plt.show()

# Bootstrap: Regression
fig, ax = plt.subplots(nrows=1, ncols=2)
for i in [1, 3]:
    sns.histplot(np.array(list(bs.values())[i][1])[:, 0], edgecolor='none',
                 color=sns.color_palette()[i], alpha=0.4, ax=ax[0])
    sns.histplot(np.array(list(bs.values())[i][1])[:, 1], edgecolor='none',
                 color=sns.color_palette()[i], alpha=0.4, ax=ax[1])

fig.suptitle('Bootstrap: Regression')
ax[0].set_title('RMSE')
ax[1].set_title('MAE')
plt.legend(title='Model', labels=list(bs.keys())[1:4:2],
           bbox_to_anchor=(1, 1.02), loc='upper left')
plt.show()
No description has been provided for this image
No description has been provided for this image

Final Model: Neural Network¶

table of contents

Neural network scores are generally better compared to XGBoost models

  • higher f1, accuracy not as important
  • lower rmse and mae

However, a closer look at the heatmap in model 2 shows better classification predictions for smote in the last 3 classes or the ones that indicate more severe fires. For this case, combining NN and XGBoost may result in a better classifier

In [26]:
target = list(labels)
temp = extreme.drop(columns='burning_index_g')
X, X_test, yC, y_testC, X_train, X_val, y_trainC, y_valC = split(temp, target)
target = ['burning_index_g']
temp = extreme.drop(columns=labels)
X, X_test, yR, y_testR, X_train, X_val, y_trainR, y_valR = split(temp, target)
In [27]:
%%time
model = modelNN(X, yC)
history = model.fit(X, [yR, yC], validation_split=0.4,
                    epochs=50, batch_size=128, verbose=0)
y_predR, y_predC = model.predict(X_test, verbose=0)
y_predC = np.argmax(y_predC, axis=1)
y_testC = np.argmax(y_testC, axis=1)
plot_loss(history)
CPU times: user 17min 39s, sys: 3min 22s, total: 21min 2s
Wall time: 7min 35s
No description has been provided for this image
In [28]:
print('Regression:    ', end=' ')
print(f'MAE={mean_absolute_error(y_testR, y_predR):.3f}', end=' ')
print(f'RMSE={mean_squared_error(y_testR, y_predR, squared=False):.3f}')
fig, axs = plt.subplots()
comp1 = heat(y_testC, y_predC, axs, 'Classification')
Regression:     MAE=5.102 RMSE=7.939
Classification: F1=0.862 Acc=0.733
class0 class1 class2 class3 class4
pre 0.916142 0.905095 0.759467 0.702970 0.637260
rec 0.946931 0.811714 0.911091 0.184868 0.812715
f1 0.931282 0.855865 0.828398 0.292748 0.714372
sup 189809.000000 204949.000000 114162.000000 13058.000000 5804.000000
No description has been provided for this image
In [29]:
target = list(labels)
temp = extremeS.drop(columns='burning_index_g')
X, X_test, yC, y_testC, X_train, X_val, y_trainC, y_valC = split(temp, target)

yC = np.argmax(yC, axis=1)
pipeline = Pipeline(steps=[('over', SMOTE()),
                           ('under', RandomUnderSampler())])
X_, yC_ = pipeline.fit_resample(X, yC)

yC_ = pd.DataFrame(to_categorical(yC_), index=X_.index)

# smote
fig, axs = plt.subplots()
xgbC = XGBClassifier(eval_metric='aucpr', random_state=100)
xgbC.fit(X_, yC_)
# weights depend on f1 score results in model 2 and 4 (more weight if higher f1)
comp2 = heat(np.argmax(y_testC, axis=1),
             np.argmax(xgbC.predict_proba(X_test) * [0.2, 0.2, 0.2, 0.8, 0.8] +
                       model.predict(X_test, verbose=0)[1] * [0.8, 0.8, 0.8, 0.2, 0.2], axis=1),
             axs, 'Neural Network + SMOTE Classification')
plt.show()
Neural Network + SMOTE Classification: F1=0.870 Acc=0.757
class0 class1 class2 class3 class4
pre 0.920011 0.904823 0.772666 0.728000 0.746606
rec 0.948527 0.817775 0.906993 0.367306 0.746606
f1 0.934051 0.859100 0.834458 0.488263 0.746606
sup 14357.000000 15415.000000 8795.000000 991.000000 442.000000
No description has been provided for this image
In [30]:
old = np.mean(comp1.loc['f1', 'class2':])
new = np.mean(comp2.loc['f1', 'class2':])
print(100*(new-old) / old)
12.738027803879405

Conclusion¶

main menu

Although the neural network performed better, a persistent error involving zero burn index appeared in each tested model. This error could be due to the unpredictability of short-lived and/or unnatural fires using GridMET's daily aggregate data. Daily data does not allow for adequate prediction for these fires because unnatural fires may start abruptly regardless of weather conditions. For this reason, a finer temporal resolution by minutes or seconds may be more useful to detect sudden changes. Furthermore, the acquired data does not include any topological information that could be beneficial.

In [ ]:
# save notebook
display(Javascript('IPython.notebook.save_checkpoint();'))
# save notebook as html to eugpoon.github.io/projects
!jupyter nbconvert  wildfire.ipynb --to html
%mv "wildfire.html" "../eugpoon.github.io/projects/"
# restyle imports, clear output, replace file
!cleanipynb wildfire.ipynb
# restart kernel
display(HTML("<script>Jupyter.notebook.kernel.restart()</script>"))